ADF Check Sum Computation v1.5.3 RF

The Allotrope Data Format (ADF) [[!ADF]] consists of several APIs and taxonomies. This document consitutes the specification of the ADF Check Sum Computation API for creating hash codes on an ADF file, or parts thereof.

Disclaimer

THESE MATERIALS ARE PROVIDED "AS IS" AND ALLOTROPE EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, INCLUDING, WITHOUT LIMITATION, THE WARRANTIES OF NON-INFRINGEMENT, TITLE, MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

This document is part of a set of specifications on the Allotrope Data Format [[!ADF]]

Introduction

An underlying design principle of the ADF check sum is that a local change (e.g., in a part of a data cube) must not require re-reading the entire file to compute the check sum.

In principle, ADF is designed to support the choice between multiple algorithms to compute the check sum. Currently, there is only one implementation, which depends on the internal representation storage as an HDF5 file. This document describes this algorithm and its application to various component parts of the ADF file.

The document is structured as follows: First, the prerequisites are defined, followed by a detailed description of the hash computation on different parts of an ADF file.

Document Conventions

Namespaces

Within this specification, the following namespace prefix bindings are used:

Prefix Namespace
owl:http://www.w3.org/2002/07/owl#
rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs:http://www.w3.org/2000/01/rdf-schema#
xsd:http://www.w3.org/2001/XMLSchema#
dct:http://purl.org/dc/terms/
skos:http://www.w3.org/2004/02/skos/core#
qb:http://purl.org/linked-data/cube#
adf-dp:http://purl.allotrope.org/ontologies/datapackage#
adf-dc:http://purl.allotrope.org/ontologies/datacube#
adf-dc-hdf:http://purl.allotrope.org/ontologies/datacube-to-hdf5-map#
adf-audit:http://purl.allotrope.org/ontologies/audit#
lc-hash:http://id.loc.gov/vocabulary/cryptographicHashFunctions/

Number Formatting

Within this document, decimal numbers will use a dot "." as the decimal mark.

In the following, we often need to represent lists of bytes. For each byte, we use two hexadecimal digits; for several bytes, we simply append those digits.

Nomenclature

We use the terms "check sum" and "hash" synonymously.

Requirements

Preliminaries

Data Types

When converting data to byte strings, different data types must be handled separately. In the rest of the document, we always specify which data type is used when an object is converted to bytes.

Numeric Data Types

The integer data types are written in a two’s complement with "big endian" byte order. The table below shows an example for each type:
Java data typeC# data typeExample valueBytes
byte sbyte45 2d
shortshort-7498 e2b6
int int 1318336784 4e943910
long long -4895739457839457ffee9b59d4b3de9f

The floating point data types (float and double) are written according to IEEE 754 with “big endian” byte order.
Java data typeC# data typeExample valueBytes
float float -16e10 d21502f9
doubledouble3254e434996cc9385c1f043

Strings

We encode strings by first writing the number of characters of that string as int (see above), followed by the string itself encoded as UTF-8.

Example: "Hällo World!" (umlaut ä instead of e is intended for the example) is encoded to the bytes 0000000c48c3a46c6c6f20576f726c6421. The first four bytes, 0000000c, encode the number of characters (12), and the next 13 bytes are the UTF-8 encoding of the string. Note that the umlaut ä is encoded to 2 bytes (c3a4) after the H (48), which leads to one byte more than the length of the string.

Message Digest Algorithms

ADF uses existing message digest algorithms to convert a stream of bytes to a check sum.

Currently supported are MD2, MD5, SHA-1, SHA-256, SHA-384, and SHA-512. The checksums in ADF (in general and for the DataPackage) are by default intended to ensure the integrity of the ADF file against accidental storage or transfer errors. To do this while providing the best performance, the default algorithm is set to MD5. This algorithm is no longer considered to be cryptographically secure, but for the intended purpose one of the best candidates. However, users can choose a configuration with one of the other algorithms. The checksum service can be configured per ADF file when activating hashing for the file. For DataPackage the algorithm can be set for each DataPackage file created individually.

When we state that data is added to a message digest, we mean that the data is converted to bytes (by the rules above), and then the resulting bytes are sequentially added to the digest algorithm. This must yield the same result as constructing an array of bytes that is filled by the outcomes of the conversions and then running the message digest algorithm on it.

Combining hash values hierarchically

We use a hierarchical approach to combine the hash values of several parts of the ADF file.

In ADF, we use this mechanism at several levels to minimize computation costs.

For the sake of brevity, we use a fictitious digest algorithm, SHA-32, for examples, where we just take the first 4 bytes of the SHA-256 algorithm’s result.

Configuration

In an ADF file's meta model, the file itself is represented by a resource ?F, which is of type hdf:File. If hashing is enabled, the meta-data model will contain the statement

      ?F adf-audit:hasDigestMethod  [
         a adf-audit:DigestMethod;
         adf-audit:hasCanonicalizationAlgorithm adf-audit:c14n-adf-hdf-1.0 ;
         adf-audit:hasDigestAlgorithm lc-hash:sha256
      ]
    
where adf-audit:c14n-adf-hdf-1.0 refers to the chosen algorithm and lc-hash:sha256 to the message digest algorithm used to calculate a digest from a byte stream.

Currently, there are two implementations of an ADF hash algorithm, of which one is only available for read-only access to older files to support backward compatibility. The only currently available choice when initializing check sums for a file is presented below. The configuration will allow to use other implementations in the future.

Valid options for the message digest algorithm are:

The file's hash value is stored in the HDF5 root group's attribute ADF_CHECKSUM, encoded as a hexadecimal string.

Specification of the algorithm ADF-HDF-2.0

The algorithm's URI is adf-audit:c14n-adf-hdf-2.0. This is the current default implementation for the check sums.

The algorithm is specific to the usage of HDF5 as the underlying file format and internal representation of the data structure. In particular, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.

The computation of the hash value is based on the structure of the ADF file's HDF5 representation. We define the computation of hashes for HDF5 datasets and HDF5 groups. The overall hash value of the file is defined as the HDF5 root group's hash value.

For each HDF5 group or dataset, the computed check-sum is stored in an attribute ADF_CHECKSUM. This allows to re-compute the complete check sum when only a part of the file has been changed. During verification, it allows to detect in which part of file data has been corrupted.

Computation of HDF5 group hash values

The hash value of an HDF5 group is based on its attributes and its child elements (other groups and datasets). The following values are added to the message digest algorithm:

  1. If the group is not the root group, the group's name is added as an UTF8 string
  2. If the group has any attributes that are not excluded (see below), the UTF8 string attributes is added.
  3. The attributes (except for the excluded ones) are sorted by their name, and each is added to the hash by
    1. Adding its name as UTF8 string
    2. Adding its value, depending on its HDF5 type:
      • HDF5 integers (type class H5T.INTEGER whose size is smaller 4 or whose size is 4 and whose sign is not 0) are encoded as integers.
      • HDF5 integers (type class H5T.INTEGER which do not fulfill the conditions of the previous item and whose size is smaller 8 or whose size is 8 and whose sign is not 0) are encoded as longs.
      • HDF5 strings (type class H5T.STRING) are encoded as UTF8 strings.
  4. If the group has any child elements, the UTF8 string elements is added.
  5. The group's child elements are sorted by their name; for each child which is not excluded (see below) we add:
    1. The name of the child as UTF-8 string
    2. The bytes of the child's hash value. If the child is a group, that value is computed recursively.

Excluded attributes

Attributes with the following names are not included in any hash:
  1. ADF_CHECKSUM
  2. checksum-adf-hdf-2.0
  3. adf-hdf-checksum-algorithm

Excluded groups

The group /check-sums and any sub-group of it will not be included in the computation of the hash value.

Computation of HDF5 dataset hash values

For every non-scalar data set, a check sum data set with the same number of dimensions is generated. It has the same path and name as the original dataset, but is located under the HDF5 group /check-sums. A scalar dataset is one without any dimensions. Such a dataset just contains just a single value.

For every dimension i, a number hashblocki is chosen that describes the number of elements that are grouped together in that direction to compute a single hash-value.

We denote the size of the data set in dimension i with sizei.

In the hash data set, we store a check sum for each group. Each dimension of the hash data set is of size hashsizei = sizei / hashblocki, where we round up (hashsizei = ceil(sizei / hashblocki)).

How hashblocki is chosen is up to the implementation. The chosen values are stored in the HDF5 attribute hash_block_size as a string which contains the comma-separated block sizes.

Each hash value of a block is the result of computing the message digest of the following values:

The data set used to store the check sums is of data type byte; the last dimension is multiplied by the length of the resulting digest (e.g. 32 for SHA-256) to store all bytes of the hash in the data set.

The overall check sum for the data set is computed by adding the following data to the message digest algorithm:

  1. For each dimension, the size of the hash digest is added (as long)
  2. We iterate over all coordinates of hashes, starting with the last dimension. For each coordinate, we add the hash contained in the hash data set to the message digest. In case of a scalar dataset there is just one block with one value which is taken directly from the source dataset because no hash dataset has been created.
  3. The attributes of the original data set are sorted by their name and for each attribute (except those listed above) we add to the digest:
    1. The name of the attribute (a string)
    2. The value of the attribute (depending on the attribute type, a string, an integer, or a long value)

Specification of the algorithm ADF-HDF-1.0

The algorithm's URI is adf-audit:c14n-adf-hdf-1.0.

Please note: This algorithm is deprecated and only documented here for reference to support older ADF files. Current implementations do not support writing files with it and future versions might drop support completely.

The algorithm is specific to the usage of HDF5 as the underlying file format and internal representation of the data structure. Especially, the serialization of the data description is not deterministic in the sense that the same content of the data description might result in different binary representations in the file, depending on the order of additions or removals.

To compute the hash code, a hash code for separate parts of the ADF file is computed.

The hash value of each part is added to the input of the message digest that computes the overall check sum.

Computation of the data cube part

The overall hash for the data cubes is computed by adding

for each data cube to the message digest algorithm.

Information on the hash is stored in the meta data for the Cube ?Cube with

            ?Cube adf-audit:hasDigest  [
               adf-audit:hasDigestMethod lc-hash:sha256;
               adf-audit:digestValue "..."^xsd:base64Binary
            ]
          
where "..." contains the base-64-encoded hash value of the data cube.

Computation of single data cube's hash

To compute the hash of a data cube, we first determine the measure datasets and scale datasets that are used to store the content of the data cube from the meta-data. The RDF resources representing the data sets are sorted alphabetically by their URI. The hash value of the data set is added to the message digest. How the hash value of a data set is computed is described below.

If the data cube contains strings or IRIs in its measurements or scales, the cube contains a dictionary that contains a representation of the strings. If a dictionary is present for the data cube, all the data sets of the dictionary are sorted alphabetically by the URI of the RDF resources that represent them. For these datasets, additional attributes of the hash set are included in the hash value, see below.

Computation of the data package part

For computing the hash value of the data package, we only consider the data sets of the files, not the directory layout and other meta-data like time of last change. The directory layout and meta-data are stored in the technical model of the ADF file, whose hash computation is described below.

Each file in the data description is backed by an HDF5 dataset. First, we take the RDF representation of all files (resources whose type is adf-dp:File) and sort them by their URI. For every file, we add

Computation of the data description part

Several HDF5 datasets are used to store the information. Their hash values are added to the message digest for the computation of the overall data description hash. The following list contains the data sets whose hash values are added, with attribute names that must be included in the hash (see below).

PathAttributesInclude size attribute of group
dictionary/bytes nextId yes
dictionary/keys nextId no
dictionary/nodes nextId no
nodes_GSPO/nodes nextId yes
nodes_GPOS/nodes nextId yes
nodes_GOSP/nodes nextId yes
nodes_SPOG/nodes nextId yes
nodes_POSG/nodes nextId yes
nodes_OSPG/nodes nextId yes
quads nextId, size no
The path given in the table is relative to the HDF5 group /data-description,
e.g. /data-description/dictionary/bytes.

For each data set in the list, the following data is added to the message digest:

The hash values of the data sets are not stored in the data description itself. They are stored as a hexadecimal string in the HDF5 attribute "CHECKSUM" instead.

Computation of the audit part

Like the data description, the audit store contains RDF graphs. To compute the audit's check sum, we first add the dictionary's dataset's check sums analogously to the data description, only the path is relative to /audit-trail.

Then, all data sets that are included in any audit trail are sorted alphabetically by their URI and for each data set

Computing Check Sums of Data Sets

In the current version of ADF, the file format HDF5 is used as an underlying storage solution. Most of the stored information in an ADF file are stored in HDF5 datasets. This section describes how the check sum of a data set is computed.

Because data sets can contain large amounts of data, we apply again the approach of combining multiple hashes to one overall hash.

A data set has n dimensions and a data type. Each entry of the data set belongs to the data type. In ADF, the data types byte, short, int, long, float and double can be used.

When we use coordinates of elements of a data set in the following, we start with 0 (not 1).

Constructing a Data Set for Check Sums

For this version of the algorithm, check sums data sets are created according to the section on the more recent algorithm, with a few deviations:
  • The path to the check sum data set is stored in the RDF metadata as described below.
  • The hash block size is stored in the RDF metadata instead of an attribute as described below.
  • Only certain attributes are added to the hash, as defined in the section about hashing data desciption above.

Meta-information for the stored datasets

Information about data sets that store hash values are stored as RDF data in the technical data description. Let's assume that a data set is represented by the resource adf://some_dataset. A resource representing the hash data set looks like the following:
        <adf://some_arbitrary_uri> a adf-hash:HashDataSet;
          adf-audit:hasDigest [
              a adf-audit:Digest
              adf-hash:checksumDatasetOf <adf://some_dataset>
              audit:hasDigestMethod [
                  a audit:DigestMethod;
                  adf-hash:hashDimension [
                      a adf-hash:HashDimensionDescription;
                      hdf:order 1 ;
                      adf-hash:hashBlockSize 1000;
                  ];
                  adf-hash:hashDimension [
                      a adf-hash:HashDimensionDescription;
                      hdf:order 2;
                      adf-hash:hashBlockSize 500;
                  ];
                  adf-audit:hasDigestAlgorithm lc-hash:sha-256;
              ]
          ]
      
The shown check sum data set has two dimensions where the number of elements to hash in the first dimension is 1000, in the second dimension, 500. The digest algorithm SHA-256, which is used to aggregate the data is also specified.

Change History

Version Release Date Remarks
1.2.0 2016-12-07
  • Initial version
1.3.0 Preview 2017-03-31
  • Updated versions and dates
1.3.0 RC 2017-05-05
  • Updated versions and dates
  • Updated standard algorithm
1.3.0 RF 2017-06-30
  • Updated versions and dates
  • Updated document to describe changes in ADF 1.3.0 RF
1.4.2 2018-01-25
  • Updated versions and dates
  • Updated document to describe changes in ADF 1.4.2
1.4.3 RC 2018-10-11
  • Updated versions and dates
1.4.5 RF 2018-12-17
  • Updated versions and dates
1.5.0 RC 2019-12-12
  • Updated versions and dates
1.5.0 RF 2020-03-03
  • Updated HDF5 reference link
1.5.3 RF 2020-11-30
  • Updated broken reference links
  • Updated PURL and DOCS server links to relative links
  • Reformat the document header